Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
translated by 谷歌翻译
基于变压器的预审前的语言模型(LMS)在自然语言的理解中无处不在,但由于其二次复杂性,无法应用于故事,科学文章和长文档等长序列。尽管已经提出了无数有效的变压器变体,但它们通常是基于需要从头开始的昂贵预处理的自定义实现。在这项工作中,我们提出了雪橇:滑动编码器和解码器,这是一种处理长序列的简单方法,可以重新使用和利用经过战斗测试的短文本预处理的LMS。具体而言,我们将输入分配到重叠的块中,用短文本LM编码器编码每个块,然后使用预审计的解码器将信息融合到跨块(Fusion-In-In-In-In-indecoder)之间。我们通过受控实验说明,雪橇提供了一种可行的策略,可以长期理解并评估我们在卷轴上的方法,这是一个基准,该基准在各种语言理解任务中具有七个数据集。我们发现,雪橇与高达50倍的专业型号具有竞争力,并且需要专用且昂贵的预处理步骤。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
NLP基准在很大程度上主要集中在短篇文本上,例如句子和段落,即使长文本在野外占相当数量的自然语言。我们介绍卷轴,这是一套需要在长文本上推理的任务套件。我们检查现有的长文本数据集,文本自然是长期的,同时优先考虑涉及在输入上扫描信息的任务。滚动包含概述,问题应答和自然语言推理任务,包括多个域,包括文学,科学,业务和娱乐。初始基线(包括啰覆编码器),表明滚动有充足的改进空间。我们以统一的文本到文本格式提供所有数据集,并托管Live Refordboard,以促进模型架构和预用方法的研究。
translated by 谷歌翻译
目前的NLP数据集可以通过母语扬声器来解决,以相对容易地解决。我们提出了基于隐秘填字游戏的大型数据集,这是语言学繁琐和自然的。Cryptonite中的每个例子是一个隐秘的线索,短短语或具有误导性表面读数的句子,其解决需要消化的语义,句法和语音字画面以及世界知识。虽然顶级专家可以解决近100%的准确性,但隐秘的线索即使对于经验丰富的求解器也可以解决挑战。Cryptonite对当前模型是一个具有挑战性的任务;在470K隐秘线索上进行微调T5大量,精度仅为7.6%,符合基于规则的线索求解器(8.6%)的准确性。
translated by 谷歌翻译
通常希望序数回归模型产生单峰预测。然而,在最近的许多作品中,这种特性缺席或使用软目标实现,不保证推理的单峰输出。此外,我们认为标准的最大可能性目标不适合序数回归问题,并且最佳运输更适合这项任务,因为它自然地捕获了类的顺序。在这项工作中,我们提出了一种基于单峰输出分布和最优运输损失的深度序数回归框架。灵感来自众所周知的比例赔率模型,我们建议通过使用架构机制来修改其设计,保证模型输出分布将是单峰的。我们经验分析了我们提出的方法的不同组成部分,并展示了他们对模型表现的贡献。八个现实世界数据集的实验结果表明,我们的建议方法始终如一地执行,并且通常比几个最近提出的深度序数回归与单峰输出概率的近似提出的深度学习方法相一致,同时保证了输出单位的保证。此外,我们证明所提出的方法比当前基线的过度较少。
translated by 谷歌翻译
We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).
translated by 谷歌翻译
Training a generative model on a single image has drawn significant attention in recent years. Single image generative methods are designed to learn the internal patch distribution of a single natural image at multiple scales. These models can be used for drawing diverse samples that semantically resemble the training image, as well as for solving many image editing and restoration tasks that involve that particular image. Here, we introduce an extended framework, which allows to simultaneously learn the internal distributions of several images, by using a single model with spatially varying image-identity conditioning. Our BlendGAN opens the door to applications that are not supported by single-image models, including morphing, melding, and structure-texture fusion between two or more arbitrary images.
translated by 谷歌翻译
Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work we provide a theoretical justification for these observations. We prove that -- even in the simplest of settings -- any interpolating learning rule (with arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that -- in the same setting -- successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.
translated by 谷歌翻译
The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(\xi^{-1} \log(1/\beta))$ (for generalization error $\xi$ with confidence $1-\beta$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\tilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\tilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tilde{\Theta}(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.
translated by 谷歌翻译